IGNITE-22459 Implement zone Raft group listener #5134

sashapolo · 2025-01-30T09:21:01Z

https://issues.apache.org/jira/browse/IGNITE-22459

Thank you for submitting the pull request.

To streamline the review process of the patch and ensure better code quality
we ask both an author and a reviewer to verify the following:

The Review Checklist

Formal criteria: TC status, codestyle, mandatory documentation. Also make sure to complete the following:
- There is a single JIRA ticket related to the pull request.
- The web-link to the pull request is attached to the JIRA ticket.
- The JIRA ticket has the Patch Available state.
- The description of the JIRA ticket explains WHAT was made, WHY and HOW.
- The pull request title is treated as the final commit message. The following pattern must be used: IGNITE-XXXX Change summary where XXXX - number of JIRA issue.
Design: new code conforms with the design principles of the components it is added to.
Patch quality: patch cannot be split into smaller pieces, its size must be reasonable.
Code quality: code is clean and readable, necessary developer documentation is added if needed.
Tests code quality: test set covers positive/negative scenarios, happy/edge cases. Tests are effective in terms of execution time and resources.

Notes

Apache Ignite Coding Guidelines

sashapolo · 2025-01-30T09:31:36Z

modules/table/src/main/java/org/apache/ignite/internal/table/distributed/TableManager.java


-                    CompletableFuture<?>[] futures = zoneTables.stream()
-                            .map(tbl -> {
-                                CompletableFuture<Void> createStoragesFuture = runAsync(


To reviewer: this change is needed, because there was a bug: getOrCreatePartitionStorages returns a CompletableFuture, which we wrapped in a runAsync having a double wrapped future, which means we didn't wait for the nested future to complete

rpuch · 2025-01-31T07:50:48Z

...src/main/java/org/apache/ignite/internal/partition/replicator/ZonePartitionRaftListener.java

+
+            clo.result(new TransactionResult(cmd.commit() ? COMMITTED : ABORTED, cmd.commitTimestamp()));
+        } else if (command instanceof PrimaryReplicaChangeCommand) {
+            // This is a hack for tests, this command is not issued in production because no zone-wide placement driver exists yet.


Don't we need a TODO here to remove the hack later?

I don't know which ticket to use here and I don't expect us to miss this place, because it will likely not work properly when we will implement the placement driver.

Found a ticket, added a TODO

rpuch · 2025-01-31T07:52:13Z

...src/main/java/org/apache/ignite/internal/partition/replicator/ZonePartitionRaftListener.java

+
+            processTableSpecificCommand(tablePartitionId, clo);
+        } else {
+            LOG.info("Message type " + command.getClass() + " is not supported by the zone partition RAFT listener yet");


Should this be a WARN?

This is from the previous implementation and this is temporary code anyway, this will become an assertion eventually

rpuch · 2025-01-31T07:55:18Z

...src/main/java/org/apache/ignite/internal/partition/replicator/ZonePartitionRaftListener.java

+        return Collections.singleton(value).iterator();
+    }
+
+    private static CommandClosure<WriteCommand> idempotentCommandClosure(CommandClosure<WriteCommand> clo) {


It seems that this is a 'no-result-propagating' closure. Why is it called idempotent?

The idea is that you can call result many times and it will not affect the state of the closure) What do you propose?

...ain/java/org/apache/ignite/internal/partition/replicator/raft/ZonePartitionRaftListener.java

sanpwc · 2025-02-02T16:46:49Z

...ain/java/org/apache/ignite/internal/partition/replicator/raft/ZonePartitionRaftListener.java

+    }
+
+    private void processTableSpecificCommand(TablePartitionId tablePartitionId, CommandClosure<WriteCommand> clo) {
+        tablePartitionRaftListeners.get(tablePartitionId).onWrite(singletonIterator(clo));


Seems that we may batch consecutive closures if they belong to the same table in the original zone closure.

Can you please elaborate? Do you mean, that instead of singletonIterator we could use an iterator with multiple commands, grouped by partition ID? But what's the point in this, all listeners just call forEach inside anyway. To save on map.get calls? Or do you mean we can parallelize the processing?

Or do you mean we can parallelize the processing?

I though about that and it's not trivial, though useful optimisation applicable in case of !full transactions. So, not for now.

To save on map.get calls?

Yep, plus iterators instantiations. If we do know, that consecutive list of command in iterator belong to same table we may batch them.
OriginalIterator: zoneCmd0, tbl1Cmd1, tbl1Cmd2, tbl2Cmd3, tbl1Cmd4, tbl1Cmd5, tbl1Cmd6
tbl1Cmd1, tbl1Cmd2 may be "converted" to iterator 1, tbl2Cmd3 to iterator 2, tbl1Cmd4, tbl1Cmd5, tbl1Cmd6 to iterator 3.

OriginalIterator: zoneCmd0, tbl1Cmd1, tbl1Cmd2, tbl2Cmd3, tbl1Cmd4, tbl1Cmd5, tbl1Cmd6
tbl1Cmd1, tbl1Cmd2 may be "converted" to iterator 1, tbl2Cmd3 to iterator 2, tbl1Cmd4, tbl1Cmd5, tbl1Cmd6 to iterator 3.

Yep, I thought of the same thing. Ok, I'll try to implement it

I've tried, it's really too much code, I can do it in a separate ticket. Also, I don't think it's worth to make the code more complex right now

sanpwc · 2025-02-02T16:50:48Z

...ain/java/org/apache/ignite/internal/partition/replicator/raft/ZonePartitionRaftListener.java

+
+            clo.result(new TransactionResult(cmd.commit() ? COMMITTED : ABORTED, cmd.commitTimestamp()));
+        } else if (command instanceof PrimaryReplicaChangeCommand) {
+            // This is a hack for tests, this command is not issued in production because no zone-wide placement driver exists yet.


Do you need it only in order for implicit transactions to work? The ones that you've introduced in ItZoneDataReplicationTest.java.

Mostly for single puts, but it makes implicit transactions work as well

In that case, please remove it for now. It'll be covered with another ticket.

What about single puts, how am I supposed to test them?

sanpwc · 2025-02-02T16:57:01Z

...ain/java/org/apache/ignite/internal/partition/replicator/raft/ZonePartitionRaftListener.java

+        synchronized (commitedConfigurationLock) {
+            currentCommitedConfiguration = config;
+
+            tablePartitionRaftListeners.values().forEach(listener -> listener.onConfigurationCommitted(config));


Curious whether it's reasonable to parallelise this. For now, it'd left it simple though.

sanpwc · 2025-02-02T17:03:33Z

...table/src/main/java/org/apache/ignite/internal/table/distributed/raft/PartitionListener.java

@@ -124,7 +124,7 @@ public class PartitionListener implements RaftGroupListener {

    private final UUID localNodeId;

-    private Set<String> currentGroupTopology;
+    private volatile Set<String> currentGroupTopology;


Why? Is it possible to touch it without raft synchronisation?

This is a little tricky. Previously, this variable was only written in onConfigurationCommitted and read in onWrite, which means it was only accessed by the single Raft write thread (onConfigurationCommitted is called from this thread, right?). Now we have a new entry point: onConfigurationCommitted can be called by the aggregate zone-level listener, when we add a new table-level listener to it, which can happen in an arbitrary thread. I think we can still avoid volatile here, because:

If onConfigurationCommitted was called from the Raft thread, then the next read would be from the Raft thread, so no synchronization is needed;

If onConfigurationCommitted is called when we are adding a new listener, then there will be a put in a ConcurrentMap, followed by a get from the same map, when calling onWrite.

However, these guarantees seem too brittle to me and I'm not sure I'm not missing something here (we are still reading through a race and may be overlooking some side effects). So I decided to have a volatile here, to be safe.

Yep, sounds reasonable. Could you please corresponding comment in code?

...n/java/org/apache/ignite/internal/table/distributed/replicator/PartitionReplicaListener.java

sanpwc · 2025-02-02T17:11:51Z

...grationTest/java/org/apache/ignite/internal/partition/replicator/ItReplicaLifecycleTest.java

@@ -262,70 +257,6 @@ private void stopNode(int idx) {
        nodes.remove(idx);
    }

-    @Test
-    public void testZoneReplicaListener(TestInfo testInfo) throws Exception {


Why is it removed? It's not the equivalent of testReplicationOnAllNodes.

How is it different? We insert data and then read it back from the primary replica...

Also, note that this test uses single-puts =)

sanpwc · 2025-02-02T17:15:18Z

...tionTest/java/org/apache/ignite/internal/partition/replicator/ItZoneDataReplicationTest.java

+        KeyValueView<Integer, Integer> kvView2 = node.tableManager.table(TEST_TABLE_NAME2).keyValueView(Integer.class, Integer.class);
+
+        // Test single insert.
+        kvView1.put(null, 42, 69);


Please use runInTransaction instead. Full transactions aren't supported yet, and the way you've mocked PrimaryReplicaChangeCommand along with preudo working single puts brings more mess. It'll be a definition of done for PrimaryReplicaChangeCommand for single puts to work properly.

along with preudo working single puts brings more mess

But I still need working single puts, what's wrong with implicit transactions? I can add tests with explicit transactions as well, that makes more sense to me.

But I still need working single puts

Why?

My point here is that you don't need implicit transactions in order to test your ticket. Implicit transactions will properly work after we will add proper PrimaryReplicaChangeCommand that on it's turn will happen after zones PD introduction.

But I still need working single puts
Why?

Because I need to test that both UpdateCommand and UpdateAllCommand are replicated. How can I do that without single puts?

My point here is that you don't need implicit transactions in order to test your ticket. Implicit transactions will properly work after we will add proper PrimaryReplicaChangeCommand that on it's turn will happen after zones PD introduction.

I agree, current PrimaryReplicaChangeCommand is needed mainly for the single-put support.

...tionTest/java/org/apache/ignite/internal/partition/replicator/ItZoneDataReplicationTest.java

sanpwc · 2025-02-03T06:35:45Z

...n/java/org/apache/ignite/internal/partition/replicator/PartitionReplicaLifecycleManager.java

@@ -210,6 +212,8 @@ public class PartitionReplicaLifecycleManager extends
    /** Configuration of rebalance retries delay. */
    private final SystemDistributedConfigurationPropertyHolder<Integer> rebalanceRetryDelayConfiguration;

+    private final ConcurrentMap<ZonePartitionId, ZonePartitionRaftListener> zonePartitionRaftListeners = new ConcurrentHashMap<>();


It's an encapsulation leek, we should have raft listeners inside replicas. Why do you need it?

I agree that this is somewhat of a encapsulation leek, but I think that code like this is even worse:

var zonePartitionReplicaListener = (ZonePartitionReplicaListener) replicaFut.join().listener(); zonePartitionReplicaListener.addTableReplicaListener(tablePartitionId, tablePartitionReplicaListenerFactory);

In my case I would also have to add special methods to access these listeners. Moreover, the lifecycle of raft and replica listeners is closely related - they are always added and removed together, so I would even prefer to have both of them in this map, this will make much easier to manage them. WDYT?

I'd rather add something like addTableProcessor method to the replica in order to substitute

((ZonePartitionReplicaListener) replicaFut.join().listener()).addTableReplicaListener(tablePartitionId, createListener); with something like replicaFut.join().addTableProcessor(tablePartitionId, createListener, tableRaftListener) in PartitionReplicaLifecycleManager#loadTableListenerToZoneReplica

addTableReplicaListener should be removed in that case.

Makes sense. But this is not possible with the current code state: if we move this logic inside Replica, then we need to be able to access ZonePartitionReplicaListener inside it, but this class is in another module, which is not accessible. We could overcome this by introducing something like ZoneReplica and place it in the correct module, but this will also mean we will need to have a ZoneReplicaManager and so on

sashapolo commented Jan 30, 2025

View reviewed changes

IGNITE-22459 Implement zone Raft group listener

e830657

sashapolo force-pushed the ignite-22459 branch from bcb1d1b to e830657 Compare January 30, 2025 09:37

rpuch reviewed Jan 31, 2025

View reviewed changes

sashapolo force-pushed the ignite-22459 branch from e4d3fce to 277fda6 Compare January 31, 2025 10:44

sashapolo added 2 commits January 31, 2025 16:44

minor

e61fd6f

Merge branch 'main' into ignite-22459

e53d829

sashapolo force-pushed the ignite-22459 branch from 277fda6 to e53d829 Compare January 31, 2025 14:44

sanpwc reviewed Feb 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IGNITE-22459 Implement zone Raft group listener #5134

IGNITE-22459 Implement zone Raft group listener #5134

sashapolo commented Jan 30, 2025

sashapolo Jan 30, 2025 •

edited

Loading

rpuch Jan 31, 2025

sashapolo Jan 31, 2025

sashapolo Feb 3, 2025

rpuch Jan 31, 2025

sashapolo Jan 31, 2025

rpuch Jan 31, 2025

sashapolo Jan 31, 2025

sanpwc Feb 2, 2025

sashapolo Feb 3, 2025 •

edited

Loading

sanpwc Feb 3, 2025 •

edited

Loading

sashapolo Feb 3, 2025

sashapolo Feb 3, 2025

sanpwc Feb 2, 2025

sashapolo Feb 3, 2025

sanpwc Feb 3, 2025

sashapolo Feb 3, 2025

sanpwc Feb 2, 2025

sanpwc Feb 2, 2025

sashapolo Feb 3, 2025 •

edited

Loading

sanpwc Feb 3, 2025

sanpwc Feb 2, 2025

sashapolo Feb 3, 2025

sanpwc Feb 2, 2025

sashapolo Feb 3, 2025

sanpwc Feb 3, 2025

sashapolo Feb 3, 2025

sanpwc Feb 3, 2025

sashapolo Feb 3, 2025

sanpwc Feb 3, 2025

sashapolo Feb 3, 2025 •

edited

Loading

IGNITE-22459 Implement zone Raft group listener #5134

Are you sure you want to change the base?

IGNITE-22459 Implement zone Raft group listener #5134

Conversation

sashapolo commented Jan 30, 2025

The Review Checklist

Notes

sashapolo Jan 30, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sashapolo Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

sanpwc Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sashapolo Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sashapolo Feb 3, 2025 • edited Loading

Choose a reason for hiding this comment

sashapolo Jan 30, 2025 •

edited

Loading

sashapolo Feb 3, 2025 •

edited

Loading

sanpwc Feb 3, 2025 •

edited

Loading

sashapolo Feb 3, 2025 •

edited

Loading

sashapolo Feb 3, 2025 •

edited

Loading